August 23rd 2019

Who?

  • Leon Eyrich Jessen

  • Assistant Professor of Bioinformatics

  • Section of Bioinformatics, Department of Health Technology, Technical University of Denmark

  • Research group: Machine Learning and Immunoinformatics

  • Focused on understanding immune activation in vaccinology, be that infectious diseases or cancer

  • Focus is on development and application of machine learning methods for understanding the interaction between the MHC, the peptide and the TCR (more on this later)

Part I: Tidysq

We are amidst a data revolution

  • Just the past 5 years, the cost of sequencing a human genome has gone down approximately 10-fold

  • This development moves equally fast within areas such as mass spectrometry, in vitro immuno-peptide screening a.o.

  • This facilitates the search for bio-markers, biologics, therapeutics, etc. but also redefines the requirements for storing, accessing and working with data

Tidysq - Work in Progress

  • The aim of tidysq is to adapt the design philosophy, grammar, and data structures of the tidyverse to biological sequence data

  • Thereby accessing the plethora of tidyverse tools available for application to sequence data

  • With an emphasis on over-coming R object size challenges to allow laptop analysis of big NGS (and alike) data sets

  • NB! What I am presenting here today, is work-in-progress

First step: Reading sequence data

  • Standard bio-sequence format FASTA looks like so:

  • Each sequence has an identifier > followed by the (multiline) sequence

  • Obviously, not a standard row x column format

>AMY1|K19|T-Protein (Tau)
PGGGKVQIVYKPV
>AMY9|K19Gluc41|T-Protein (Tau)
NLKHQPGGGKVQIVYKPVDLSKVTSKCGSLGNIHHKPGGGQVE
>AMY14|K19Gluc782|T-Protein (Tau)
NLKHQPGGGKVQIVYKEVD
>AMY17|PHF8|T-Protein (Tau)
GKVQIVYK
>AMY18|PHF6|T-Protein (Tau)
VQIVYK

First step: Reading sequence data

seq_dat = read_fasta(file = "data/seqs.fasta")
seq_dat
## # A tibble: 5 x 2
##   name                             sq                      
##   <chr>                            <(c)ami>                
## 1 AMY1|K19|T-Protein (Tau)         PGGGKVQIVYKPV       <13>
## 2 AMY9|K19Gluc41|T-Protein (Tau)   NLKHQPGGGKVQIVY...  <43>
## 3 AMY14|K19Gluc782|T-Protein (Tau) NLKHQPGGGKVQIVY...  <19>
## 4 AMY17|PHF8|T-Protein (Tau)       GKVQIVYK             <8>
## 5 AMY18|PHF6|T-Protein (Tau)       VQIVYK               <6>
  • FASTA file is split into two columns, the (observation) id and the sequence (variable)

  • So, tidy data in a tibble, that’s nice and familiar

  • Note the special data type for the sequences, with a length indicator

Second step: Reducing data size

Find Motifs

  • Implementing various sequence manipulation / extrapolation tools, capable of working on the compressed format
seq_dat %>% mutate(has_GG_motif = sq %has% "GG")
## # A tibble: 5 x 3
##   name                             sq                       has_GG_motif
##   <chr>                            <(c)ami>                 <lgl>       
## 1 AMY1|K19|T-Protein (Tau)         PGGGKVQIVYKPV       <13> TRUE        
## 2 AMY9|K19Gluc41|T-Protein (Tau)   NLKHQPGGGKVQIVY...  <43> TRUE        
## 3 AMY14|K19Gluc782|T-Protein (Tau) NLKHQPGGGKVQIVY...  <19> TRUE        
## 4 AMY17|PHF8|T-Protein (Tau)       GKVQIVYK             <8> FALSE       
## 5 AMY18|PHF6|T-Protein (Tau)       VQIVYK               <6> FALSE

Adanced motifs

  • Implementing various sequence manipulation / extrapolation tools, capable of working on the compressed format
seq_dat %>% filter(sq %has% "^PXG")
## # A tibble: 1 x 2
##   name                     sq                 
##   <chr>                    <(c)ami>           
## 1 AMY1|K19|T-Protein (Tau) PGGGKVQIVYKPV  <13>

Sequence Encoding

  • For application in machine learning, we need to convert biology to a numerical representation

  • This can be done using e.g. Hydropathy index (Kyte-Doolittle, 1982):

seq_dat  %>% 
  encode(sq = sq, 
         encoding = AAindex_norm["KYTJ820101",])
  • Other encoding schemes, e.g. BLOSUM will follow

Export and import tidysq objects

  • We are working on interfacing tidysq with ape, seqinr and Biostrings

  • Aiming at enabling easy sequence manipulation / exploration in the tidyverse paradigm and then export to harvest the power of existing packages

Tidysq Credit

  • Michal Burdukiewicz \(^1\)
  • Dominik Rafacz \(^1\)
  • Weronika Puchala \(^1\)
  • Filip Pietluch \(^1\)
  • Katarzyna Sidorczuk \(^1\)
  • Stefan Roediger \(^2\)
  • Leon Eyrich Jessen \(^3\)
  1. Warsaw University of Technology, Warsaw, Poland
  2. Brandenburg University of Technology, Cottbus, Germany
  3. Technical University of Denmark, Lyngby, Denmark

Part II: ML Driven Epitope Prediction in Cancer Immunotherapy

Long story short

  • A key component in cancer immunotherapy is identifying targets on the tumour surface

  • It is then possible to mobilise the patients own immune system against these targets

  • These targets arise due to mutations and are known as neo-epitopes

  • If we can find good targets, we can
    • engineer special cells to kill the cancer (CAR T-cells)
    • see of the patient has special cells but in too low abundance
  • The targets are small fragments of proteins, known as peptides and consists of 9 amino acid residues

  • They are presented on the surface of the cell by the MHC Class I molecule and it is this pMHC interaction we are interested in

  • Possible number of targets is 9^20 = 1.215767e+19

  • Not feasible to screen all targets

  • We need a model to suggest probable targets for lab tests

The netMHC suite of prediction tools

  • Our research group was the first to create such prediction tools and kick-started the field ~20 years ago

  • Flag ship state-of-the-art prediction servers:

  • MHC Class I ligands: NetMHCpan 4.0: Prediction of peptide-MHC class I binding using artificial neural networks prediction server, method paper

  • MHC Class II ligands: NetMHCIIpan 3.2 Server: Prediction of peptide-MHC class II binding using artificial neural networks prediction server, method paper

Example

  • Let us say we have some data on peptides which are either strong binders, weak binder or non-binders:
pep_dat
## # A tibble: 23,760 x 4
##    peptide   label_chr label_num data_type
##    <chr>     <chr>         <dbl> <chr>    
##  1 LLTDAQRIV WB                1 train    
##  2 LMAFYLYEV SB                2 train    
##  3 VMSPITLPT WB                1 test     
##  4 SLHLTNCFV WB                1 train    
##  5 RQFTCMIAV WB                1 train    
##  6 HQRLAPTMP NB                0 train    
##  7 FMNGHTHIA SB                2 train    
##  8 KINPYFSGA WB                1 train    
##  9 WLLIFHHCP NB                0 train    
## 10 NIWLAIIEL WB                1 train    
## # … with 23,750 more rows

Example

  • We can the split the data in a test and a training set
pep_dat %>% count(label_chr, data_type)
## # A tibble: 6 x 3
##   label_chr data_type     n
##   <chr>     <chr>     <int>
## 1 NB        test        782
## 2 NB        train      7138
## 3 SB        test        802
## 4 SB        train      7118
## 5 WB        test        792
## 6 WB        train      7128

Example

  • and convert the peptides to a numerical representation using the BLOSUM62 matrix
pep_dat = pep_dat %>%
  mutate(peptide_enc = peptide %>% encode_peptide(m = BLOSUM62)
  • 20 numbers per amino acids, such that a 9-mer peptide becomes 180 encoded variables

  • The advantage of BLOSUM62 is that it is evolutionary based, so it caries natural information on amino acid similarity

  • This is superior to one-hot encoding, where all numeric representations are orthogonal

Example

  • and train an artificial neural network using TensorFlow via Keras

  • once the network is trained, we can evaluate the performance

Network Performance

So, that’s it…

  • …now we are ready to conquer the world

  • but wait…

  • what actually happened here?

So, that’s it…

  • …now we are ready to conquer the world

  • but wait…

  • what actually happened here?

The infamous “black box”

  • I hear it a lot: “Neural nets are black boxes”

  • So what did it learn?

  • What is the important characteristics for strong binding peptides?

  • Which features carry the binding signal?

  • How can we extract biological insights from the model?

Open the “black box”

  • Well, now that we have a working model…

  • We could use the model we created to predict a large set of new peptides

  • and then using information theory, see what the model learned?

Open the “black box”

Open the “black box”

Open the “black box”

Open the “black box”

  • We can see that there is clear signal from the 2nd and 9th position

  • Now we can go back at look at the MHC to understand the biology

  • and if we look at an X-ray crystallography protein structure, we can indeed see that positions 2 and 9 anchors the peptide

  • This also means, that we could revisit the model creation focusing only on positions 2 and 9

Adapted from Andreatta & Nielsen, 2015

This was brief, but much more details…

Summary

  • The tidysq package

    • Aim to adapt the design philosophy, grammar, and data structures of the tidyverse to biological sequence data

    • enables accessing the plethora of tidyverse tools available for application to sequence data

  • Using neural networks we can

    • identify possible targets in cancer immunotherapy

    • learn about the underlying biology

Acknowledgements

  • Professor Morten Nielsen

  • The Immunoinformatics and Machine Learning Group at DTU Healthtech and IIB-UNSAM

  • All the hard working PhD students, Post-docs and Colleagues

  • Primary Collaborators
    • Søren Buus, IMMI, UCPH: MHC binding/Tetramers
    • A. Sette, B. Peters. La Jolla Institute of Allergy and Infectious Diseases: Immune Epitope database
    • Nicola Ternette, Nuffield Department of Medicine, Oxford
  • My online presence: Twitter, LinkedIn, GitHub